Goto

Collaborating Authors

 Łódź Province


The PLLuM Instruction Corpus

Pęzik, Piotr, Żarnecki, Filip, Kaczyński, Konrad, Cichosz, Anna, Deckert, Zuzanna, Garnys, Monika, Grabarczyk, Izabela, Janowski, Wojciech, Karasińska, Sylwia, Kujawiak, Aleksandra, Misztela, Piotr, Szymańska, Maria, Walkusz, Karolina, Siek, Igor, Chrabąszcz, Maciej, Kołos, Anna, Karlińska, Agnieszka, Seweryn, Karolina, Krasnodębska, Aleksandra, Betscher, Paula, Cieślińska, Zofia, Kowol, Katarzyna, Wilczek, Artur, Trzciński, Maciej, Dziewulska, Katarzyna, Roszko, Roman, Bernaś, Tomasz, Vaičenonienė, Jurgita, Roszko, Danuta, Levchuk, Paweł, Kowalski, Paweł, Prawdzic-Jankowska, Irena, Kozłowski, Marek, Dadas, Sławomir, Poświata, Rafał, Wróblewska, Alina, Krasnowska-Kieraś, Katarzyna, Ogrodniczuk, Maciej, Rudolf, Michał, Rybak, Piotr, Saputa, Karolina, Wołoszyn, Joanna, Oleksy, Marcin, Koptyra, Bartłomiej, Ferdinan, Teddy, Woźniak, Stanisław, Piasecki, Maciej, Walkowiak, Paweł, Wojtasik, Konrad, Janz, Arkadiusz, Kazienko, Przemysław, Moska, Julia, Kocoń, Jan

arXiv.org Artificial Intelligence

This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.


Are Foundation Models Useful for Bankruptcy Prediction?

Kostrzewa, Marcin, Furman, Oleksii, Furman, Roman, Tomczak, Sebastian, Zięba, Maciej

arXiv.org Artificial Intelligence

Foundation models have shown promise across various financial applications, yet their effectiveness for corporate bankruptcy prediction remains systematically unevaluated against established methods. We study bankruptcy forecasting using Llama-3.3-70B-Instruct and TabPFN, evaluated on large, highly imbalanced datasets of over one million company records from the Visegrád Group. We provide the first systematic comparison of foundation models against classical machine learning baselines for this task. Our results show that models such as XGBoost and CatBoost consistently outperform foundation models across all prediction horizons. LLM-based approaches suffer from unreliable probability estimates, undermining their use in risk-sensitive financial settings. TabPFN, while competitive with simpler baselines, requires substantial computational resources with costs not justified by performance gains. These findings suggest that, despite their generality, current foundation models remain less effective than specialized methods for bankruptcy forecasting.


The Correspondence Between Bounded Graph Neural Networks and Fragments of First-Order Logic

Grau, Bernardo Cuenca, Feng, Eva, Wałęga, Przemysław A.

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) address two key challenges in applying deep learning to graph-structured data: they handle varying size input graphs and ensure invariance under graph isomorphism. While GNNs have demonstrated broad applicability, understanding their expressive power remains an important question. In this paper, we propose GNN architectures that correspond precisely to prominent fragments of first-order logic (FO), including various modal logics as well as more expressive two-variable fragments. To establish these results, we apply methods from finite model theory of first-order and modal logics to the domain of graph representation learning. Our results provide a unifying framework for understanding the logical expressiveness of GNNs within FO.


LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models

Pęzik, Piotr, Kaczyński, Konrad, Szymańska, Maria, Żarnecki, Filip, Deckert, Zuzanna, Kwiatkowski, Jakub, Janowski, Wojciech

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff. This creates a strict knowledge boundary beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks, potentially compromising response accuracy. We introduce LLMLagBench, an LLM freshness benchmark, as a systematic approach for identifying the earliest probable temporal boundaries of an LLM's training data by evaluating its knowledge of recent events. We then apply this benchmark to evaluate a large set of LLMs, including models with both explicitly declared and undeclared training cutoffs. The reliability of the benchmark is assessed by manual validation and comparison with publicly released information about LLM pretraining.


PLLuM: A Family of Polish Large Language Models

Kocoń, Jan, Piasecki, Maciej, Janz, Arkadiusz, Ferdinan, Teddy, Radliński, Łukasz, Koptyra, Bartłomiej, Oleksy, Marcin, Woźniak, Stanisław, Walkowiak, Paweł, Wojtasik, Konrad, Moska, Julia, Naskręt, Tomasz, Walkowiak, Bartosz, Gniewkowski, Mateusz, Szyc, Kamil, Motyka, Dawid, Banach, Dawid, Dalasiński, Jonatan, Rudnicka, Ewa, Alberski, Bartłomiej, Walkowiak, Tomasz, Szczęsny, Aleksander, Markiewicz, Maciej, Bernaś, Tomasz, Mazur, Hubert, Żyta, Kamil, Tykierko, Mateusz, Chodak, Grzegorz, Kajdanowicz, Tomasz, Kazienko, Przemysław, Karlińska, Agnieszka, Seweryn, Karolina, Kołos, Anna, Chrabąszcz, Maciej, Lorenc, Katarzyna, Krasnodębska, Aleksandra, Wilczek, Artur, Dziewulska, Katarzyna, Betscher, Paula, Cieślińska, Zofia, Kowol, Katarzyna, Mikoś, Daria, Trzciński, Maciej, Krutul, Dawid, Kozłowski, Marek, Dadas, Sławomir, Poświata, Rafał, Perełkiewicz, Michał, Grębowiec, Małgorzata, Kazuła, Maciej, Białas, Marcin, Roszko, Roman, Roszko, Danuta, Vaičenonienė, Jurgita, Utka, Andrius, Levchuk, Paweł, Kowalski, Paweł, Prawdzic-Jankowska, Irena, Ogrodniczuk, Maciej, Borys, Monika, Bulińska, Anna, Gumienna, Wiktoria, Kieraś, Witold, Komosińska, Dorota, Krasnowska-Kieraś, Katarzyna, Kobyliński, Łukasz, Lewandowska, Martyna, Łaziński, Marek, Łątkowski, Mikołaj, Mastalerz, Dawid, Milewicz, Beata, Mykowiecka, Agnieszka Anna, Peljak-Łapińska, Angelika, Penno, Sandra, Przybysz, Zuzanna, Rudolf, Michał, Rybak, Piotr, Saputa, Karolina, Tomaszewska, Aleksandra, Wawer, Aleksander, Woliński, Marcin, Wołoszyn, Joanna, Wróblewska, Alina, Żuk, Bartosz, Żarnecki, Filip, Kaczyński, Konrad, Cichosz, Anna, Deckert, Zuzanna, Garnys, Monika, Grabarczyk, Izabela, Janowski, Wojciech, Karasińska, Sylwia, Kujawiak, Aleksandra, Misztela, Piotr, Szymańska, Maria, Walkusz, Karolina, Siek, Igor, Kwiatkowski, Jakub, Pęzik, Piotr

arXiv.org Artificial Intelligence

Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.


Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer

Szolnoky, Kelvin, Blilie, Anders, Mulliqi, Nita, Tsuzuki, Toyonori, Samaratunga, Hemamali, Titus, Matteo, Ji, Xiaoyi, Boman, Sol Erika, Gudlaugsson, Einar, Kjosavik, Svein Reidar, Asenjo, José, Gambacorta, Marcello, Libretti, Paolo, Braun, Marcin, Kordek, Radisław, Łowicki, Roman, Delahunt, Brett, Iczkowski, Kenneth A., van der Kwast, Theo, van Leenders, Geert J. L. H., Leite, Katia R. M., Pan, Chin-Chen, Janssen, Emiel Adrianus Maria, Eklund, Martin, Egevad, Lars, Kartasalo, Kimmo

arXiv.org Artificial Intelligence

Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model's performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen's kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen's kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen's kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen's kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.


Integrating Domain Knowledge into Process Discovery Using Large Language Models

Norouzifar, Ali, Kourani, Humam, Dees, Marcus, van der Aalst, Wil

arXiv.org Artificial Intelligence

Process discovery aims to derive process models from event logs, providing insights into operational behavior and forming a foundation for conformance checking and process improvement. However, models derived solely from event data may not accurately reflect the real process, as event logs are often incomplete or affected by noise, and domain knowledge, an important complementary resource, is typically disregarded. As a result, the discovered models may lack reliability for downstream tasks. We propose an interactive framework that incorporates domain knowledge, expressed in natural language, into the process discovery pipeline using Large Language Models (LLMs). Our approach leverages LLMs to extract declarative rules from textual descriptions provided by domain experts. These rules are used to guide the IMr discovery algorithm, which recursively constructs process models by combining insights from both the event log and the extracted rules, helping to avoid problematic process structures that contradict domain knowledge. The framework coordinates interactions among the LLM, domain experts, and a set of backend services. We present a fully implemented tool that supports this workflow and conduct an extensive evaluation of multiple LLMs and prompt engineering strategies. Our empirical study includes a case study based on a real-life event log with the involvement of domain experts, who assessed the usability and effectiveness of the framework.


Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future Directions

Khapre, Smita, Mersha, Melkamu Abay, Shakil, Hassan, Baruah, Jonali, Kalita, Jugal

arXiv.org Artificial Intelligence

The evolution of digital communication systems and the designs of online platforms have inadvertently facilitated the subconscious propagation of toxic behavior. Giving rise to reactive responses to toxic behavior. Toxicity in online content and Artificial Intelligence Systems has become a serious challenge to individual and collective well-being around the world. It is more detrimental to society than we realize. Toxicity, expressed in language, image, and video, can be interpreted in various ways depending on the context of usage. Therefore, a comprehensive taxonomy is crucial to detect and mitigate toxicity in online content, Artificial Intelligence systems, and/or Large Language Models in a proactive manner. A comprehensive understanding of toxicity is likely to facilitate the design of practical solutions for toxicity detection and mitigation. The classification in published literature has focused on only a limited number of aspects of this very complex issue, with a pattern of reactive strategies in response to toxicity. This survey attempts to generate a comprehensive taxonomy of toxicity from various perspectives. It presents a holistic approach to explain the toxicity by understanding the context and environment that society is facing in the Artificial Intelligence era. This survey summarizes the toxicity-related datasets and research on toxicity detection and mitigation for Large Language Models, social media platforms, and other online platforms, detailing their attributes in textual mode, focused on the English language. Finally, we suggest the research gaps in toxicity mitigation based on datasets, mitigation strategies, Large Language Models, adaptability, explainability, and evaluation.


Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models

Walczak, Jakub, Tomalak, Piotr, Laskowski, Artur

arXiv.org Artificial Intelligence

--Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) across several families. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields definitely smaller gains. Notably, the chain-of-thought prompting strategy -- applied even to'reasoning' models -- achieves the best results, with up to 96.3% branch coverage, a 57% average mutation score, and near-perfect compilation success rate. Among the evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage being still in top in terms of compilation success rate. ECENT years have brought significant advancements in artificial intelligence (AI), particularly in the areas of performance and productivity enhancement. However, AI -- and particularly large language models (LLMs) -- still suffer from several weaknesses. Among them, convincing but senseless content generation ('hallucination'), safety misalignment ('ethicality') [1], unfairness [2], and limited processing context are the most critical. In spite of these restrictions, and bearing in mind the limited and merely apparent creativity of LLMs [3], they have become versatile tools already widely used across a variety of domains (creative industries [4], entertainment, reporting, and software engineering [5] are just cases in point) for multiple tasks.


SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation

Perełkiewicz, Michał, Dadas, Sławomir, Poświata, Rafał

arXiv.org Artificial Intelligence

This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.